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Abstract 

Hierarchical  reinforcement  learning  (HRL)  is  the  study  of  mechanisms  for  exploiting  the 
structure  of  tasks  in  order  to  learn  more  quickly.  By  decomposing  tasks  into  subtasks,  fully 
or  partially  specified  subtask  solutions  can  be  reused  in  solving  tasks  at  higher  levels  of 
abstraction.  The  theory  of  semi-Markov  decision  processes  provides  a  theoretical  basis  for 
HRL.  Several  variant  representational  schemes  based  on  SMDP  models  have  been  studied 
in  previous  work,  all  of  which  are  based  on  the  discrete-time  discounted  SMDP  model.  In 
this  approach,  policies  are  learned  that  maximize  the  long-term  discounted  sum  of  rewards. 

In  this  paper  we  investigate  two  formulations  of  HRL  based  on  the  average-reward 
SMDP  model,  both  for  discrete  time  and  continuous  time.  In  the  average-reward  model, 
policies  are  sought  that  maximize  the  expected  reward  per  step.  The  two  formulations 
correspond  to  two  different  notions  of  optimality  that  have  been  explored  in  previous  work 
on  HRL:  hierarchical  optimality,  which  corresponds  to  the  set  of  optimal  policies  in  the 
space  defined  by  a  task  hierarchy,  and  a  weaker  local  model  called  recursive  optimality. 
What  distinguishes  the  two  models  in  the  average  reward  framework  is  the  optimization 
of  subtasks.  In  the  recursively  optimal  framework,  subtasks  are  treated  as  continuing,  and 
solved  by  finding  gain  optimal  policies  given  the  policies  of  their  children.  In  the  hierarchical 
optimality  framework,  the  aim  is  to  find  a  globally  gain  optimal  policy  within  the  space 
of  policies  defined  by  the  hierarchical  decomposition.  We  present  algorithms  that  learn  to 
find  recursively  and  hierarchically  optimal  policies  under  discrete-time  and  continuous-time 
average  reward  SMDP  models. 

We  use  four  experimental  testbeds  to  study  the  empirical  performance  of  our  proposed 
algorithms.  The  first  two  domains  are  relatively  simple,  and  include  a  small  autonomous 
guided  vehicle  (AGV)  scheduling  problem  and  a  modified  version  of  the  well-known  Taxi 
problem.  The  other  two  domains  are  larger  real-world  single-agent  and  multiagent  AGV 
scheduling  problems.  We  model  these  AGV  scheduling  tasks  using  both  discrete-time  and 
continuous-time  models  and  compare  the  performance  of  our  proposed  algorithms  with 
each  other,  as  well  as  with  other  HRL  methods  and  to  standard  Q-learning.  In  the  large 
AGV  domain,  we  also  show  that  our  proposed  algorithms  outperform  widely  used  industrial 
heuristics,  such  as  “first  come  first  serve”,  “highest  queue  first ”  and  “nearest  station  first”. 


Keywords:  Hierarchical  Reinforcement  Learning,  Semi-Markov  Decision  Processes,  Av¬ 
erage  Reward,  Hierarchical  and  Recursive  Optimality. 
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1.  Introduction 

Reinforcement  learning  (RL)  (Bertsekas  and  Tsitsiklis,  1996,  Sutton  and  Barto,  1998)  is 
the  study  of  algorithms  that  enable  agents  embedded  in  stochastic  environments  to  learn 
what  actions  to  take  in  different  situations  or  states  in  order  to  maximize  a  scalar  feedback 
function  or  reward  over  time.  The  mapping  from  states  to  actions  is  referred  to  as  a  policy 
or  a  closed-loop  plan.  Learning  occurs  based  on  the  idea  that  the  tendency  to  produce 
an  action  should  be  reinforced  if  it  produces  favorable  long-term  rewards,  and  weakened 
otherwise.  From  the  perspective  of  control  theory,  RL  algorithms  can  be  shown  to  be 
approximations  to  classical  approaches  to  solving  optimal  control  problems.  The  classi¬ 
cal  sample-based  approaches  use  dynamic  programming  (DP)  (Bertsekas,  1995,  Puterman, 
1994),  which  requires  perfect  knowledge  of  the  system  dynamics  and  payoff  function.  Re¬ 
inforcement  learning  has  the  advantage  of  potentially  being  able  to  find  optimal  solutions 
(or  close-to-optimal)  solutions  in  domains  where  models  are  not  known  or  unavailable. 

Broadly  speaking,  the  two  main  approaches  to  RL  are  to  search  the  policy  space  directly 
using  the  gradient  of  the  parametric  representation  of  the  policy  with  respect  to  some 
performance  metric  -  the  so-called  policy  gradient  formulation  (Marbach,  1998,  Baxter 
and  Bartlett,  2001)  -  or,  instead,  to  learn  an  indirect  target  function,  referred  to  as  the 
value  function  as  it  represents  the  long-term  payoff  associated  with  states  or  state-action 
pairs.  The  policy  can  be  recovered  from  a  value  function  by  choosing  “greedy”  actions  that 
maximize  the  value  of  states  nearby  (or  immediate  state  action  pairs).  In  this  paper,  we 
focus  on  the  value  function-based  approach,  although  we  have  recently  begun  to  investigate 
hierarchical  policy  gradient  RL  algorithms  as  well  (Ghavamzadeh  and  Mahadevan,  2003). 

The  asymptotic  convergence  of  value  function-based  RL  algorithms,  such  as  Q-learning 
(Watkins,  1989)  or  TD(A)  (Sutton,  1988),  is  only  assured  in  restricted  cases,  typically  when 
the  values  are  represented  explicitly  for  each  state.  Often,  real-world  problems  require  us¬ 
ing  function  approximators  for  which  convergence  is  not  guaranteed  in  general.  In  such 
cases,  convergence  is  guaranteed  only  if  the  value  function  is  approximated  using  a  linear 
superposition  of  basis  feature  values,  and  samples  are  generated  using  an  on  policy  distri¬ 
bution.  However,  even  if  asymptotic  convergence  was  theoretically  guaranteed  by  adhering 
to  these  restrictions,  in  practice  these  algorithms  can  take  hundreds  of  thousands  of  epochs 
to  converge. 

The  central  focus  of  this  paper  is  to  present  new  algorithms  for  reinforcement  learning, 
applicable  to  discrete-time  and  continuous-time  continuing  tasks,  using  which  convergence 
occurs  much  more  rapidly  than  with  traditional  Q-learning  (Watkins,  1989).  The  new 
algorithms  are  based  on  extending  hierarchical  reinforcement  learning  (HRL),  a  general 
framework  for  scaling  reinforcement  learning  to  problems  with  large  state  spaces  by  using 
the  task  (or  action)  structure  to  restrict  the  space  of  policies.  The  key  principle  underlying 
HRL  is  to  develop  learning  algorithms  that  do  not  need  to  learn  policies  from  scratch, 
but  instead  reuse  existing  policies  for  simpler  subtasks  (or  macro  actions).  The  difficulty 
with  using  the  traditional  framework  for  reusing  learned  policies  is  that  decision  making  no 
longer  occurs  in  synchronous  unit-time  steps,  as  is  traditionally  assumed  in  RL.  Instead, 
decision-making  occurs  in  epochs  of  variable  length,  such  as  when  a  distinguishing  state  is 
reached  (e.g.,  an  intersection  in  a  robot  navigation  task),  or  a  subtask  is  completed  (e.g., 
the  elevator  arrives  on  the  first  floor). 
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Fortunately,  a  well-known  statistical  model  is  available  to  treat  variable  length  actions: 
the  semi-Markov  decision  process  (SMDP)  model  (Howard,  1971,  Puterman,  1994).  Here, 
state  transition  dynamics  is  specified  not  only  by  the  state  where  an  action  was  taken,  but 
also  parameters  specifying  the  length  of  time  since  the  action  was  taken.  Early  work  in  RL 
on  the  SMDP  model  studied  extensions  of  algorithms  such  as  Q-learning  (Bradtke  and  Duff, 
1995,  Mahadevan  et  al.,  1997b).  This  early  work  on  SMDP  models  was  then  expanded  to 
include  hierarchical  task  models  over  fully  or  partially  specified  lower  level  subtasks.  The 
options  model  (Sutton  et  al.,  1999)  in  its  simplest  form  studied  how  to  learn  policies  given 
fully  specified  policies  for  executing  subtasks.  The  hierarchical  abstract  machines  (HAMs) 
formulation  (Parr,  1998)  showed  how  hierarchical  learning  could  be  achieved  even  when  the 
policies  for  lower-level  subtasks  were  only  partially  specified.  Lastly,  the  MAXQ  framework 
(Dietterich,  2000)  provided  a  fully  comprehensive  framework  for  hierarchical  learning  where 
instead  of  specifying  policies  for  subtasks,  the  learner  is  given  pseudo-reward  functions. 
While  a  full  comparison  of  these  variant  approaches  is  beyond  the  scope  of  this  paper,  what 
these  treatments  have  in  common  is  that  they  are  all  based  on  the  discrete-time  discounted 
reward  SMDP  framework. 

The  average-reward  formulation  has  been  shown  to  be  more  appropriate  for  a  wide 
class  of  continuing  tasks.  A  primary  goal  of  continuing  tasks,  including  manufacturing, 
scheduling,  queuing  and  inventory  control,  is  to  find  a  gain  optimal  policy  that  maximizes 
(minimizes)  the  long-run  average  reward  (cost)  over  time.  Although  average  reward  RL 
has  been  extensively  studied,  using  both  the  discrete-time  MDP  model  (Schwartz,  1993, 
Mahadevan,  1996,  Tadepalli  and  Ok,  1996a,  Marbach,  1998,  Van-Roy,  1998)  as  well  as  the 
continuous-time  SMDP  model  (Mahadevan  et  al.,  1997b,  Wang  and  Mahadevan,  1999), 
prior  work  has  been  limited  to  flat  policy  representations. 

In  this  paper,  we  extend  previous  work  on  hierarchical  reinforcement  learning  to  the 
average  reward  SMDP  framework  and  present  discrete-time  and  continuous-time  hierar¬ 
chical  average  reward  RL  algorithms  corresponding  to  two  notions  of  optimality  in  HRL: 
hierarchical  optimality  and  recursive  optimality.  A  secondary  contribution  of  this  paper  is 
to  illustrate  how  HRL  can  be  applied  to  more  interesting  (and  practical)  domains  than  has 
been  illustrated  previously.  In  particular,  we  focus  on  autonomous  guided  vehicle  (AGV) 
scheduling,  although  our  approach  easily  generalizes  to  other  problems,  such  as  transfer 
line  production  control  (Gershwin,  1994,  Wang  and  Mahadevan,  1999).  We  use  four  exper¬ 
imental  testbeds  to  study  the  empirical  performance  of  our  proposed  algorithms.  The  first 
two  domains  are  simple,  a  small  autonomous  guided  vehicle  (AGV)  scheduling  problem  and 
a  modified  version  of  the  Taxi  problem  (Dietterich,  2000).  The  other  domains  are  much 
larger  real-world  single-agent  and  multiagent  AGV  scheduling  problems.  We  model  these 
AGV  scheduling  tasks  using  both  discrete-time  and  continuous-time  models  and  compare 
the  performance  of  our  proposed  algorithms  with  each  other,  as  well  as  with  the  MAXQ 
method  (Dietterich,  2000)  and  to  standard  Q-learning.  In  the  multiagent  AGV  domain,  we 
also  show  that  our  proposed  extensions  outperform  widely  used  industrial  heuristics,  such 
as  “first  come  first  serve”,  “highest  queue  first”  and  “nearest  station  first”. 

The  rest  of  this  paper  is  organized  as  follows.  Section  (2)  describes  a  framework  for  hi¬ 
erarchical  reinforcement  learning  which  is  used  to  develop  the  algorithms  of  this  paper.  In 
Section  (3),  we  present  two  discrete-time  and  two  continuous-time  hierarchical  average  re¬ 
ward  RL  algorithms.  Section  (3.1)  reviews  average  reward  discrete-time  SMDPs.  In  Section 
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(3.3),  we  present  discrete-time  and  continuous-time  hierarchically  optimal  average  reward 
RL  algorithms.  In  Section  (3.4),  we  investigate  different  methods  to  formulate  subtasks  in 
a  recursively  optimal  hierarchical  average  reward  RL  framework  and  present  discrete-time 
and  continuous-time  recursively  optimal  hierarchical  average  reward  RL  algorithms.  Section 
(4)  provides  a  brief  overview  of  the  Automated  Guided  Vehicle  (AGV)  scheduling  problem, 
which  is  used  in  the  experimental  study  presented  in  this  paper.  Section  (5)  presents  ex¬ 
perimental  results  of  using  the  proposed  algorithms  in  a  simple  AGV  scheduling  problem,  a 
modified  version  of  the  Taxi  problem  and  large  real-world  single-agent  and  multiagent  AGV 
scheduling  problems.  Section  (6)  summarizes  the  paper  and  discusses  some  directions  for 
future  work.  Finally,  we  list  the  notation  used  in  this  paper  in  Appendix  A. 


2.  A  Framework  for  Hierarchical  Reinforcement  Learning 

In  this  section,  we  introduce  a  general  hierarchical  reinforcement  learning  framework  for 
simultaneous  learning  at  multiple  levels  of  the  hierarchy.  Our  treatment  builds  upon  the 
existing  approaches,  including  the  MAXQ  value  function  decomposition  (Dietterich,  2000), 
hierarchies  of  abstract  machines  (HAMs)  (Parr,  1998),  and  the  options  model  (Sutton  et  al., 
1999).  We  describe  the  common  principles  underlying  these  variant  formulations  below, 
ignoring  some  of  the  subtle  differences  between  the  frameworks.  In  the  next  section,  we 
will  extend  this  framework  to  average  reward  model  and  present  our  hierarchical  average 
reward  reinforcement  learning  algorithms. 

2.1  Motivating  Example 

Hierarchical  reinforcement  learning  methods  provide  a  general  framework  for  scaling  re¬ 
inforcement  learning  to  problems  with  large  state  spaces  by  using  the  task  structure  to 
restrict  the  space  of  policies.  In  these  methods,  the  designer  of  the  system  uses  his/her 
domain  knowledge  to  recursively  decompose  the  overall  task  into  a  collection  of  subtasks 
that  he/she  believes  are  important  for  solving  the  problem. 


A  :  Agent 

Tl:  Location  of  one  trash  can 
T2:  Location  of  another  trash  can 
Dump:  Location  for  depositing  all  trash 


Figure  1:  A  (simulated)  robot  trash  collection  task  and  its  associated  task  graph. 
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Let  us  illustrate  the  idea  using  a  simple  search  task  shown  in  Figure  (1).  Consider 
the  case  where  in  an  office  (rooms  and  connecting  corridors)  type  environment,  a  robot  is 
assigned  the  task  of  picking  up  trash  from  trash  cans  (T1  and  T2)  over  an  extended  area 
and  accumulating  it  into  one  centralized  trash  bin  (Dump),  from  where  it  might  be  sent 
for  recycling  or  disposed.  The  main  subtasks  in  this  problem  are  root  (the  whole  trash 
collection  task),  collect  trash  at  T 1  and  T2,  navigate  to  T 1,  T2  and  Dump.  Each  of  these 
subtasks  is  defined  by  a  set  of  termination  states,  and  terminates  when  reaches  one  of  its 
termination  states.  After  defining  subtasks,  we  must  indicate  for  each  subtask,  which  other 
subtasks  or  primitive  actions  it  should  employ  to  reach  its  goal.  For  example,  navigate  to 
T 1,  T2  and  Dump  use  three  primitive  actions  find  wall,  align  with  wall  and  follow  wall. 
Collect  trash  at  T 1  uses  two  subtasks  navigate  to  T 1  and  Dump,  plus  two  primitive  actions 
Put  and  Pick,  and  so  on. 

All  of  this  information  can  be  summarized  by  a  directed  acyclic  graph  called  the  task 
graph.  The  task  graph  for  the  trash  collection  problem  is  shown  in  Figure  (1).  A  key 
challenge  for  any  HRL  method  is  how  to  support  temporal  abstraction,  state  abstraction 
and  subtask  sharing. 

•  Temporal  Abstraction:  The  process  of  navigating  to  T1  is  a  temporally  extended 
action  that  can  take  different  lengths  of  time  to  complete  depending  on  the  distance 
to  Tl. 

•  State  Abstraction:  While  the  agent  is  moving  toward  the  Dump,  the  status  of  trash 
cans  Tl  and  T 2  are  irrelevant  and  cannot  affect  this  navigation  process.  Therefore, 
the  variables  defining  the  status  of  trash  cans  Tl  and  T 2  can  be  removed  from  the 
state  space  of  navigate  to  Dump  subtask. 

•  Subtask  Sharing:  If  the  system  could  learn  how  to  solve  the  navigate  to  Dump 
subtask  once,  then  the  solution  could  be  shared  by  both  collect  trash  at  Tl  and  T 2 
subtasks. 

2.2  Temporal  Abstraction  using  SMDPs 

Hierarchical  RL  studies  how  lower-level  policies  over  subtasks  or  primitive  actions  can  them¬ 
selves  be  composed  into  higher  level  policies.  Policies  over  primitive  actions  are  “semi- 
Markov”  when  composed  at  the  next  level  up,  because  they  can  take  variable  stochastic 
amount  of  time.  Thus,  semi-Markov  decision  processes  (SMDPs)  have  become  the  preferred 
language  for  modeling  temporally  extended  actions  (Mahadevan  et  al.,  1997a).  We  briefly 
explain  the  basic  SMDP  model  here,  leaving  details  of  the  average-reward  formulation  to 
later  sections.  Semi-Markov  decision  processes  extend  the  MDP  model  in  several  aspects. 
Decisions  are  only  made  at  discrete  points  in  time.  The  state  of  the  system  may  change  con¬ 
tinually  between  decisions,  unlike  MDPs  where  state  changes  are  only  due  to  actions.  Thus, 
the  time  between  transitions  may  be  several  time  units  and  can  depend  on  the  transition 
that  is  made. 

An  SMDP  is  defined  as  a  four  tuple  ( S,A,P,R ),  where  S  is  a  finite  set  of  states,  A 
is  the  set  of  actions,  P  :  SxAfxSxA^  [0,1]  is  a  set  of  state  and  action  dependent 
multi-step  transition  probabilities,  and  R  is  the  reward  function.  P(s' ,  N\s,  a)  denotes  the 
probability  that  action  a  will  cause  the  system  to  transition  from  state  s  to  state  -s'  in  N 
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time  steps.  This  transition  is  at  decision  epochs  only.  Basically,  the  SMDP  model  represents 
snapshots  of  the  system  at  decision  points,  whereas  the  so-called  natural  process  describes 
the  evolution  of  the  system  over  all  times. 

While  SMDP  theory  provides  the  theoretical  underpinnings  of  temporal  abstraction  by 
allowing  for  actions  that  take  varying  amounts  of  time,  the  SMDP  model  provides  little  in 
the  way  of  concrete  representational  guidance  which  is  critical  from  a  computational  point 
of  view.  In  particular,  the  SMDP  model  does  not  specify  how  tasks  can  be  broken  up  into 
subtasks,  how  to  decompose  value  functions  etc.  We  examine  these  issues  next. 

Mathematically,  a  task  hierarchy  such  as  the  one  illustrated  above  can  be  modeled  by 
decomposing  the  overall  task  MDP  M,  into  a  finite  set  of  subtasks  {Mq,  M±, . . .  ,  Mn}, 
where  Mq  is  the  root  task  and  solving  it  solves  the  entire  MDP  M. 

Definition  1:  Each  non-primitive  subtask  i  (i  is  not  a  primitive  action)  consists  of  five 
components  (Si,  Ii,Ti,  Ai,  Ri): 

•  Si  is  the  state  space  for  subtask  i.  It  is  described  by  those  state  variables  that  are 
relevant  to  subtask  i.  The  range  of  the  state  variables  describing  Si  might  be  a  subset 
of  their  range  in  S  (state  abstraction). 

•  Ii  is  the  initiation  set  for  subtask  i.  Subtask  i  can  be  initiated  only  in  states  belong 
to  Ii. 

•  Tj  is  the  set  of  terminal  states  for  subtask  i.  Subtask  i  terminates  when  it  reaches 
a  state  in  Tj.  The  policy  for  subtask  i  can  only  be  executed  if  the  current  state  s 
belongs  to  (Si  —  Tj). 

•  Ai  is  the  set  of  actions  that  can  be  performed  to  achieve  subtask  i.  These  actions  can 
either  be  primitive  actions  from  A  (the  set  of  primitive  actions  for  MDP  M),  or  they 
can  be  other  subtasks. 

•  Ri  is  the  reward  structure  inside  subtask  i  and  could  be  different  from  the  reward 
function  of  MDP  M.  Besides  the  reward  of  the  overall  task  (MDP  M),  each  subtask 
i  can  use  additional  rewards  to  guide  its  local  learning  (Ng  et  al.,  1999).  Additional 
rewards  are  only  used  inside  each  subtask  and  do  not  propagate  to  upper  levels  in  the 
hierarchy.  If  the  reward  structure  inside  a  subtask  is  different  than  the  reward  function 
of  the  overall  task,  we  need  to  define  two  types  of  value  functions  for  the  subtask, 
internal  value  functions  and  external  value  functions.  Internal  value  functions  are 
defined  based  on  both  the  local  reward  structure  of  the  subtask  and  the  reward  of 
the  overall  task,  and  only  used  in  learning  the  subtask.  On  the  other  hand,  external 
value  functions  are  defined  only  based  on  the  reward  function  of  the  overall  task  and 
propagated  to  higher  levels  in  the  hierarchy  to  be  used  in  learning  the  global  policy. 

Each  primitive  action  a  is  a  primitive  subtask  in  this  decomposition,  such  that  a  is  always 
executable  and  it  terminates  immediately  after  execution.  From  now  on  in  this  paper,  we 
use  subtask  to  refer  to  non-primitive  subtasks. 
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2.3  Policy  Execution 

If  we  have  a  policy  for  each  subtask  in  this  model,  it  gives  us  a  policy  for  the  overall  task. 
This  collection  of  policies  is  called  a  hierarchical  policy. 

Definition  2:  A  hierarchical  policy  p  is  a  set  with  a  policy  for  each  of  the  subtasks  in 
the  hierarchy:  p  =  {po, . . .  ,  pn\. 


The  hierarchical  policy  is  executed  using  a  stack  discipline,  similar  to  ordinary  program¬ 
ming  languages.  Each  subtask  policy  takes  a  state  and  returns  the  name  of  a  primitive 
action  to  execute  or  the  name  of  a  subtask  to  invoke.  When  a  subtask  is  invoked,  its  name 
is  pushed  onto  the  Task  Stack  and  its  policy  is  executed  until  it  enters  one  of  its  terminal 
states.  When  a  subtask  terminates,  its  name  is  popped  off  the  Task  Stack.  If  any  subtask  on 
the  Task  Stack  terminates,  then  all  subtasks  below  it  are  immediately  aborted,  and  control 
returns  to  the  subtask  that  had  invoked  the  terminated  subtask.  Hence,  at  any  time,  the 
root  subtask  is  located  at  the  bottom  and  the  subtask  which  is  currently  being  executed  is 
located  at  the  top  of  the  Task  Stack. 

Under  a  hierarchical  policy  p,  we  define  a  multi-step  transition  probability  P(l  :  Si  x 
A f  x  Si  —>  [0, 1]  for  each  subtask  i  in  the  hierarchy,  where  Ph (s' ,  N\s)  denotes  the  prob¬ 
ability  that  action  pi(s)  will  cause  the  system  to  transition  from  state  s  to  state  s'  in  N 
primitive  steps.  We  also  define  a  single-step  transition  probability  function  for  each  subtask 
i  under  hierarchical  policy  p  by  marginalizing  the  multi-step  transition  probability  function 
Ph,  as  Fh(s'\s)  =  Y1n= i  Pi{s',N |s).  Fh(s',n\s)  denotes  the  n-step  (or  abstract)  transition 
probability  from  state  s  to  state  s'  under  hierarchical  policy  p,  where  n  is  the  number  of 
actions  taken  by  subtask  i,  not  the  number  of  primitive  actions  taken  in  this  transition.  In 
this  paper,  we  use  the  abstract  transition  probability  F  to  model  state  transition  at  the 
subtask  level  and  transition  probability  P  to  model  state  transition  at  the  level  of  primitive 
actions. 

Definition  3:  Under  a  hierarchical  policy  p,  each  subtask  i  can  be  modeled  by  an  SMDP 
consists  of  components  (Si,  Ai,  P(L ,  Ri). 

2.4  Local  versus  Global  Optimality 

In  the  HRL  framework,  the  designer  imposes  a  hierarchy  on  the  problem  to  incorporate 
prior  knowledge  and  thereby  reduces  the  size  of  the  space  that  must  be  searched  to  find 
a  good  policy.  However,  this  hierarchy  constrains  the  space  of  possible  policies  so  that  it 
may  not  be  possible  to  represent  the  optimal  policy  or  its  value  function  and  hence  make  it 
impossible  to  learn  the  optimal  policy.  If  we  cannot  learn  the  optimal  policy,  the  next  best 
target  would  be  to  learn  the  best  policy  that  is  consistent  with  the  given  hierarchy.  Two 
notions  of  optimality  have  been  explored  in  previous  work  on  hierarchical  reinforcement 
learning: 

Definition  4:  Hierarchical  optimality  is  a  global  optimum  consistent  with  the  given  hi¬ 
erarchy.  In  this  form  of  optimality,  the  policy  for  each  individual  subtask  is  not  necessarily 
optimal,  but  the  policy  for  the  entire  hierarchy  is  optimal.  The  HAMQ  HRL  algorithm 
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(Parr,  1998)  and  the  SMDP  Q-learning  algorithm  for  a  fixed  set  of  options  (Sutton  et  al., 
1999)  both  converge  to  a  hierarchically  optimal  policy.  More  formally,  a  hierarchical  opti¬ 
mal  policy  for  MDP  M  is  a  hierarchical  policy  which  has  the  best  performance  among  all 
policies  consistent  with  the  given  hierarchy. 

Definition  5:  Recursive  optimality  is  a  weaker  but  more  flexible  form  of  optimality  which 
only  guarantees  that  the  policy  of  each  subtask  is  optimal  given  the  policies  of  its  children. 
It  is  an  important  and  flexible  form  of  optimality  because  it  permits  each  subtask  to  learn 
a  locally  optimal  policy  while  ignoring  the  behavior  of  its  ancestors  in  the  hierarchy.  This 
increases  the  opportunities  for  subtask  sharing  and  state  abstraction.  The  MAXQ-Q  HRL 
algorithm  (Dietterich,  2000)  converges  to  a  recursively  optimal  policy.  More  formally,  a 
recursive  optimal  policy  for  MDP  M  with  hierarchical  decomposition  {Mo,  Mi, . . .  ,  Mn}  is 
a  hierarchical  policy  p  =  {po, . . .  ,  pn}  such  that  for  each  subtask  Mt ,  the  corresponding 
policy  m  is  optimal  for  the  SMDP  defined  by  the  tuple  (Si,  Ap  P-1,  Rf). 

2.5  Value  Function  Definitions 

For  recursive  optimality,  the  goal  is  to  find  a  hierarchical  policy  p  =  {po, . . .  ,  pn}  such  that 
for  each  subtask  Mt  in  the  hierarchy,  the  expected  cumulative  reward  of  executing  policy  pi 
and  the  policies  of  all  descendants  of  Mj  is  maximized.  In  this  case,  the  value  function  to 
be  learned  for  subtask  i  under  hierarchical  policy  p  must  contain  only  the  reward  received 
during  the  execution  of  subtask  i.  We  call  this  the  projected  value  function  and  define  it  as 
follows: 

Definition  6:  The  projected  value  function  of  hierarchical  policy  p  on  subtask  Mi,  de¬ 
noted  V^(i,  s),  is  the  expected  cumulative  reward  of  executing  policy  pi  and  the  policies  of 
all  descendants  of  Mi  starting  in  state  s  6  Si  until  Mi  terminates. 

The  expected  cumulative  reward  outside  a  subtask  is  not  a  part  of  its  projected  value 
function.  It  makes  the  projected  value  function  of  a  subtask  dependent  only  on  itself  and 
its  descendants. 

On  the  other  hand,  for  hierarchical  optimality,  the  goal  is  to  find  a  hierarchical  policy 
that  maximizes  the  expected  cumulative  reward.  In  this  case,  the  value  function  to  be 
learned  for  subtask  i  under  hierarchical  policy  p  must  contain  the  reward  received  during 
the  execution  of  subtask  i  and  the  reward  after  subtask  i  terminates.  We  call  this  the  hier¬ 
archical  value  function.  The  hierarchical  value  function  of  a  subtask  includes  the  expected 
reward  outside  the  subtask  and  therefore  depends  on  the  subtask  and  all  its  ancestors  up 
to  the  root  of  the  hierarchy.  In  the  case  of  hierarchical  optimality,  we  need  to  consider  the 
contents  of  the  Task  Stack  as  an  additional  part  of  the  state  space  of  the  problem,  since  a 
subtask  might  be  shared  by  multiple  parents. 

Definition  7:  D  is  the  space  of  possible  values  of  the  Task  Stack  for  hierarchy  Ti . 

Let  us  define  a  joint  state  space  X  as  the  cross  product  of  Task  Stack  values  D  and  the 
states  S  in  hierarchy  Ti.  We  define  the  hierarchical  value  function  using  state  space  X  as: 
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Definition  8:  A  hierarchical  value  function  for  subtask  Mt  in  state  x  =  (w,  s)  and  un¬ 
der  hierarchical  policy  /j.  denoted  UM(i,x),  is  the  expected  cumulative  reward  of  following 
the  hierarchical  policy  n  starting  in  state  s  G  Si  and  Task  Stack  u. 

The  current  subtask  i  is  a  part  of  the  Task  Stack  lo  and  as  a  result  is  a  part  of  state 
x.  So  we  can  exclude  it  from  the  hierarchical  value  function  notation  and  write  V^(i,x)  as 
V^(x).  However,  we  keep  the  current  subtask  i  as  a  part  of  the  hierarchical  value  function 
notation  to  simplify  the  notation  in  the  following  section. 

3.  Hierarchical  Average  Reward  Reinforcement  Learning 

Given  the  above  fundamental  principles  of  HRL,  we  can  now  proceed  to  describe  our  hierar¬ 
chical  average  reward  formulation.  We  begin  with  a  review  of  average  reward  discrete-time 
SMDPs. 

3.1  Discrete-time  Average  Reward  SMDPs 

The  theory  of  infinite-horizon  SMDPs  with  the  average  reward  criterion  is  more  complex 
than  that  for  discounted  models  (Howard,  1971,  Puternran,  1994).  To  simplify  exposition 
we  assume  that  for  every  stationary  policy,  the  embedded  Markov  chain  has  a  unichain 
transition  probability  matrix.1  Under  this  assumption,  the  expected  average  reward  of 
every  stationary  policy  does  not  vary  with  the  initial  state. 

For  policy  fi,  state  s  G  S  and  number  of  time  steps  N  >  0,  Vj^(s)  denotes  the  expected 
total  reward  generated  by  the  policy  g  up  to  time  step  N,  given  the  system  occupies  state 
s  at  time  0,  and  is  defined  as 


{N- 1 

£ r(sk,ak ) 

k= 0 

The  average  expected  reward  or  gain  <7M(s)  for  a  policy  /j  in  state  s  can  be  defined  by  taking 
the  ratio  of  the  expected  total  reward  and  the  number  of  decision  epochs.  The  gain  g^(s) 
of  a  policy  g  can  be  expressed  as 


g»{s)  =  lirninf 

TV— XX) 


ESI EfcAfaiM 

N 


For  unichain  MDPs,  the  gain  of  any  policy  is  state  independent  and  we  can  write  g^(s)  =  g1' . 
For  each  transition,  the  expected  number  of  transition  steps  until  the  next  decision  epoch 
is  defined  as 


V(s,  a)  =  E“{N}  =J2NT,  N\s ’  °) 

N=0  s'£S 


1.  The  underlying  Markov  chain  for  every  stationary  policy  has  a  single  recurrent  class,  and  a  (possibly 
empty)  set  of  transient  states. 
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The  expected  average  adjusted  sum  of  rewards  H ^  for  policy  fi  is  defined  as 

OO  OO 

H^{s)  =  E£{^2[r(sk,ak)  -  g^(sk)}}  =  £^{£[r(sfe,  afc)  -  g^}} 
k= 0  k= 0 

The  Bellman  equation  for  the  average  adjusted  value  function  H1'  can  be  written  as 

H^(s)  =r(s,n(s))~  g^y(s,g{s))+  £  P(s' ,  N\s,  g{s))H^(s') 

N,s'£S 

The  average  adjusted  action  value  function  L^(s,a)  represents  the  average  adjusted  value 
of  doing  action  a  in  state  s  once,  and  then  following  policy  //  subsequently  is  defined  as 

Ltl(s,a)  =  r(s,a)-gtly(s,a)+  £  P{s',  N\s,  a)L^{s' ,  n{s')) 

N,s'£S 


3.2  Assumptions 

In  this  paper,  we  consider  continuing  HRL  problems  for  which  the  following  assumptions 
hold. 

Assumption  1  (Continuing  Root  Task)  The  root  of  the  hierarchy  is  a  continuing  task, 
i.e. ,  the  root  task  goes  on  continually  without  terminating. 

Assumption  2  (Root  Task  Recurrence)  There  exists  a  state  -Sq  6  So  such  that,  for 
every  hierarchical  policy  g  and  for  every  state  s  G  So,  we  have2 

I 'S'o  | 

>  o 

n=  1 

where  n  is  the  number  of  steps  at  the  level  of  root  task,  not  the  number  of  primitive  actions 
as  defined  in  Section  (2.3). 

Assumption  (2)  is  equivalent  to  assuming  that  the  underlying  Markov  chain  for  every 
policy  of  the  root  task  has  a  single  recurrent  class  and  the  state  Sq  is  a  recurrent  state.3 
Under  this  assumption,  the  balance  equations  for  policy  /./ 

|So| 

£WI*K(s)  =  rf(s'), 

s=  1 

|So| 

= 1 

s=l 

have  a  unique  solution  7Tq  =  (7Tq(1),...  , 7Tq (|S'o|)).  We  refer  to  7Tq  as  the  steady  state 
probability  vector  of  the  Markov  chain  with  transition  probability  Fy(s'\s)  and  to  ttq  (s)  as 
the  steady  state  probability  of  being  in  state  s. 

2.  Notice  that  the  root  task  is  represented  as  subtask  Mo  in  the  HRL  framework  described  in  Section  (2). 

3.  This  assumption  can  be  relaxed  by  assuming  that  the  MDP  corresponding  to  the  root  task  is  unichain. 


,  |So|  -  1 


(1) 
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If  assumption  (2)  holds,  the  gain  g11  is  well  defined  for  every  hierarchical  policy  g  and 
does  not  depend  on  the  initial  state.  We  have  the  following  relation: 

9 M  =  X]  7ro(s)r(s>/io (s)) 

sESq 


When  assumption  (2)  holds,  we  are  interested  in  finding  a  hierarchical  control  policy  g* 
which  maximizes  the  gain,  i.e., 


9 ^  for  all  g  (2) 

We  refer  to  a  hierarchical  policy  g*  which  satisfies  condition  (2)  as  a  gain  optimal  policy , 
and  to  as  the  optimal  average  reward  or  the  optimal  gain. 

However,  since  the  policy  learned  for  the  root  task  involves  the  policies  of  its  children, 
the  type  of  optimality  achieved  at  root  depends  on  how  we  formulate  subtasks  in  the  hierar¬ 
chy.  We  already  addressed  two  notions  of  optimality:  hierarchical  optimality  and  recursive 
optimality.  In  Section  (3.3),  we  introduce  an  algorithm  to  find  a  hierarchically  gain  opti¬ 
mal  policy  (a  hierarchical  policy  that  has  the  maximum  gain  among  all  hierarchical  policies) 
(Ghavamzadeh  and  Mahadevan,  2002).  In  Section  (3.4),  we  investigate  different  approaches 
for  finding  a  recursively  gain  optimal  policy  (a  hierarchical  policy  in  which  the  policy  at  each 
node  has  the  maximum  gain  given  the  policies  of  its  children)  and  introduce  a  recursively 
gain  optimal  average  reward  HRL  algorithm. 

3.3  Hierarchically  Gain  Optimal  Average  Reward  RL  Algorithm 

In  this  section,  we  consider  problems  for  which  assumptions  (1)  and  (2)  ( Continuing  Root 
Task)  and  ( Root  Task  Recurrence )  hold,  i.e.,  the  average  reward  for  root  (overall  problem) 
is  well  defined  for  every  hierarchical  policy  and  does  not  vary  with  initial  state.  We  use  the 
hierarchical  RL  framework  described  in  Section  (2).  Since  we  are  interested  in  finding  the 
hierarchical  optimal  policy,  we  include  the  contents  of  the  Task  Stack  as  a  part  of  the  state 
space  of  the  problem.  We  also  replace  value  function  and  action-value  function  with  average 
adjusted  value  function  and  average  adjusted  action- value  function  in  the  hierarchical  model 
of  Section  (2). 

The  hierarchical  average  adjusted  value  function  H  for  hierarchical  policy  g  and  subtask 
i,  denoted  RM(i,.x),  is  the  average  adjusted  sum  of  rewards  earned  of  following  policy  g 
starting  in  state  x  =  (cu,  s)  until  i  terminates  plus  the  expected  average  adjusted  reward 
outside  subtask  i: 


N- 1 

H^(i,x)  =  lirn  E£{J2(r(xk,ak)  -  g^)}  (3) 

iV— xx)  — 

k= 0 

where  g ^  is  the  gain  of  the  root  task  and  we  call  it  global  gain  of  the  hierarchical  policy  g. 

Now  let  us  suppose  that  the  first  action  chosen  by  g  is  invoked  and  executed  for  a  number 
of  primitive  steps  N\  and  terminates  in  state  x\  =  (cu,si)  according  to  P?(x\,  IVi|a:,  gi(x)) 
and  after  that  subtask  i  itself  executes  for  n2  steps  at  the  level  of  subtask  i  (ri2  is  the 
number  of  actions  taken  by  subtask  i,  not  the  number  of  primitive  actions)  and  terminates 
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in  state  X2  =  (u;,  S2)  according  to  abstract  transition  probability  Fk(x2,  n2\x{).  We  can 
write  Equation  (3)  in  the  form  of  a  Bellman  equation: 

H^(i,  x)  =  r(x,  m(x))  -  g^y^x,  m(x))  +  ^  P^xi,  Nx\x,  m(x))H^(i,xi) 

Ni’Si&Si  (4) 

+  2  x,tH(x))  FjJ'(x2,n2\xi)H^(Parent(i),(LU  /*  i,  S2)) 

si&Si  n2,S2&Si 


where  H^{i, .)  is  the  projected  average  adjusted  value  function  of  hierarchical  policy  fi  on 
subtask  i,  and  w  /*  i  is  the  content  of  the  Task  Stack  after  popping  subtask  i  off.  Notice 
that  H  does  not  contain  the  reward  outside  the  current  subtask  and  should  be  distinguished 
with  the  hierarchical  average  adjusted  value  function  H,  which  includes  the  sum  of  rewards 
outside  the  current  subtask. 

Since  r(x,/j,i(x))  is  the  expected  total  reward  between  two  decision  epochs  of  subtask 
i,  given  that  the  system  occupies  state  x  at  the  first  decision  epoch  and  decision  maker 
chooses  action  fXi(x),  we  have 

r(x,m(x))  =  \  w,s))  =  H^(m(x),  (m(x)  \  u,s))  +  g^yi{x,yi(x)) 

where  Hi(x)  \  lo  is  the  content  of  the  Task  Stack  after  pushing  subtask  Hi(x)  onto  it.  By 
replacing  r(x,Hi(x))  from  the  above  expression,  Equation  (4)  can  be  written  as 

11^(1,  x)  =  H^(ni(x),(m(x)\uj,s))  +  ^2 

Nlts!  eSi 

(5) 

+  Ft(xi\x,lH(x))  J2  Fjx(x2,n2\xi)HIJ’(Parent(i),(u  y  i,  S2)) 

S\£Si  fl2-,S2^Si 

We  can  re-state  Equation  (5)  for  hierarchical  average  adjusted  action-value  function  as 
LM(*,x,a)  =  -fP(a,  (a  \  uj,s))  +  ^  P^Xi,  N1\x,a)ffM(i,x1) 

NuSl &Si 

+  ^2  Fi(xAx,a)  ^2  Fi{x2,n2\xi)L^{Parent{i)1{u  /  i,  s2),  gparent(i){u  /  M2)) 

S\£Si  TI2  ,S2^.Si 


and  we  can  re-express  the  definition  for  H  as 
(L^(i,s,gi(s 

X  z)[r(s'|s,  z)  -g»\ 


HIJ( 


if  i  is  a  composite  action 
if  i  is  a  primitive  action 


(6) 


where  L  is  the  projected  average  adjusted  action-value  function. 

The  above  formulas  can  be  used  to  obtain  update  equations  for  H ,  L  and  L  in  this  frame¬ 
work.  Pseudo-code  for  the  resulting  algorithm  is  shown  in  Algorithm  (1).  After  running 
for  appropriate  time,  this  algorithm  should  generate  a  gain-optimal  policy  that  maximizes 
the  average  reward  for  the  overall  task.  In  this  algorithm,  primitive  subtasks  update  only 
their  projected  average  adjusted  value  functions  H  (line  5),  while  non-primitive  subtasks 
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update  both  their  projected  average  adjusted  action-value  functions  L  and  hierarchical  av¬ 
erage  adjusted  action- value  functions  L  (lines  17  and  18).  We  store  only  one  global  gain  g 
and  update  it  after  each  non-random  primitive  action  (line  7).  In  update  formulas  at  lines 
17  and  18,  the  projected  average  adjusted  value  function  H(a,  (a  \  u,  s))  is  the  reward  of 
executing  action  a  in  state  (u>,  s)  under  subtask  i  and  is  recursively  calculated  by  subtask 
a  and  its  descendants  using  Equation  (6). 

Algorithm  1  The  discrete-time  hierarchically  gain  optimal  average  reward  RL  algorithm. 
X:  Function  HO-AR(Task  i,  State  x  =  (tu,  s)) 

2:  let  Seq  =  {}  be  the  sequence  of  states  visited  while  executing  i 
3:  if  i  is  a  primitive  action  then 

4:  execute  action  i  in  state  x,  observe  state  x'  =  (lo,s')  and  reward  r(s'|s,z) 

5:  Ht+i(i,  x )  < —  (1  -  at)Ht(i,  x )  +  at[r(s'\s,  i)  -  gt\ 

6:  if  i  and  all  its  ancestors  are  non-randonr  actions  then 

7:  update  the  global  average  reward  9t+i  =  =  rt+nr^|s’^ 

8:  end  if 

9:  push  state  x\  =  (w  i,  s )  into  the  beginning  of  Seq 

10:  else 

11:  while  i  has  not  terminated  do 

12:  choose  action  a  according  to  the  current  exploration  policy  Hi{x) 

13:  let  ChildS eg=HO-AR(a,  (a  \  u,  s)),  where  ChildSeq  is  the  sequence  of  states  visited 

while  executing  action  a 
14:  observe  result  state  x'  =  (u,  s') 

15:  let  a*  =  argmaxare^xs/\Lt(i,  x' ,  a') 

16:  for  each  x  =  (u,  s)  in  ChildSeq  from  the  beginning  do 

17:  Lt+i(i,  x,  a)  <  (1  at)Lt(i,  x,  a)  +  at[Ht(a,  (a  \  u,  s))  +  Lt(i,  x',  a*)] 

18:  Lt+i(i,  x,  a)  <—  (1  —  at)Lt(i,  x,  a )  +  at[Ht(a ,  (a  \  u,  s))  +  Lt(i,  x' ,  a*)] 

19:  replace  state  x  =  (cu,  s)  with  x\  =  (u  i,  s)  in  the  ChildSeq 

20:  end  for 

21:  append  ChildSeq  onto  the  front  of  Seq 

22:  X  =  x' 

23:  end  while 

24:  end  if 
25:  return  Seq 
26:  end  HO-AR 


This  algorithm  can  be  easily  extended  to  continuous-time  by  changing  the  update  for¬ 
mulas  for  H  and  g  in  lines  5  and  7  as 

x)< — (1  -  at)Ht(i,  x)  +  at[k(s,  i)  +  r(s'\s,  i)r(s'|s,  i)  -  gtT(s'\s,  *)] 

_  rt+ 1  _  rt  +  k(s,i)  +  r(s'\s,i)T(s'\s,i) 

9t+1  tt+ 1  tt  +  T(s'\s,i) 

where  r(s/|s,i)  is  the  time  elapsing  between  states  s  and  s',  k(s,i)  is  the  fixed  reward  of 
taking  action  i  in  state  s  and  r(s/|s,i)  is  the  reward  rate  for  the  time  that  the  natural 
process  remains  in  state  s'  between  decision  epochs. 
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3.4  Recursively  Gain  Optimal  Average  Reward  RL 


In  the  previous  section,  we  introduced  hierarchically  gain  optimal  average  reward  RL  al¬ 
gorithms  for  both  discrete  and  continuous  time  problems.  In  the  proposed  average  reward 
algorithms,  we  define  only  a  global  gain  for  the  entire  hierarchy  to  guarantee  global  opti¬ 
mality  for  the  overall  task.  The  hierarchical  policy  has  the  highest  gain  among  all  policies 
consistent  with  the  given  hierarchy.  However,  there  might  exist  a  subtask  where  its  policy 
must  be  locally  suboptinral  so  that  the  overall  policy  becomes  optimal. 

Recursive  optimality  is  a  kind  of  local  optimality  in  which  the  policy  at  each  node  is 
optimal  given  the  policies  of  its  children.  The  reason  to  seek  recursive  optimality  rather 
than  hierarchical  optimality  is  that  recursive  optimality  makes  it  possible  to  solve  each 
subtask  without  reference  to  the  context  in  which  it  is  executed.  This  leaves  open  the 
question  of  what  local  optimality  criterion  should  be  used  for  each  subtask  except  root  in 
the  recursive  optimal  average  reward  HRL  setting.4  One  possibility  is  to  simply  optimize  the 
total  reward  of  every  subtask  in  the  hierarchy  except  root.  Another  possibility,  investigated 
in  (Ghavamzadeh  and  Mahadevan,  2001),  is  to  treat  subtasks  as  average  reward  problems 
that  maximize  their  gain  given  the  policies  of  their  children.  We  will  describe  this  approach 
in  detail  later  in  this  section.  Finally  the  third  approach,  pursued  in  (Seri  and  Tadepalli, 
2002),  is  to  optimize  subtasks  using  their  expected  total  relativized  reward  with  respect  to 
the  gain  of  the  overall  task  (gain  of  the  root  task).  Seri  and  Tadepalli  (Seri  and  Tadepalli, 
2002)  introduce  a  model-based  algorithm  called  Hierarchical  H-Learning  (HH-Learning) . 
For  every  subtask,  this  algorithm  learns  the  action  model  and  maximizes  the  expected  total 
average  adjusted  reward  with  respect  to  the  gain  of  the  overall  task  at  each  state.  In  their 
approach,  the  projected  average  adjusted  value  functions  with  respect  to  the  gain  of  the 
overall  task  satisfy  the  following  Bellman  equations: 

( r(s'|s,  i)  —  g^lT(s'\s,  i )  if  i  is  a  primitive  action 


0 


if  s  is  a  goal  state  for  subtask  i  (7) 


[maxaeAi(s)[HV(a,s)  +  EvyeSi  pt (s'>  N\s’  a)#M(b  s')]  otherwise 


The  first  term  of  the  last  part  of  Equation  (7),  H^l(a,s),  denotes  the  expected  total  aver¬ 
age  adjusted  reward  during  the  execution  of  subtask  a,  and  the  second  term  denotes  the 
expected  total  average  adjusted  reward  from  then  on  until  the  completion  of  subtask  i. 
Since  the  expected  average  adjusted  reward  after  subtask  i  execution  is  not  a  component 
of  the  average  adjusted  value  function,  this  approach  does  not  necessarily  allow  for  hier¬ 
archical  gain  optimality  (as  will  be  shown  in  experiments  of  Section  (5)).  Moreover,  the 
policy  learned  for  each  subtask  using  this  approach  is  not  context  free,  because  each  node 
maximizes  its  relativized  reward  with  respect  to  the  gain  of  the  overall  policy.  However,  this 
method  finds  the  hierarchically  gain  optimal  policy  when  the  result  distribution  invariance 
condition  holds  (Seri  and  Tadepalli,  2002). 

On  the  other  hand,  the  approach  in  which  subtasks  are  treated  as  average  reward  prob¬ 
lems  (Ghavamzadeh  and  Mahadevan,  2001)  might  fail  to  find  the  hierarchical  gain  optimal 

4.  Like  the  previous  section,  we  consider  those  problems  for  which  assumptions  (1)  and  (2)  ( Continuing 
Root  Task)  and  ( Root  Task  Recurrence)  hold.  Thus,  for  root,  the  goal  is  to  maximize  its  gain,  given  the 
policies  for  its  descendants. 
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policy  as  shown  in  (Seri  and  Tadepalli,  2002).  However  the  policy  learned  at  each  node 
using  this  method  maximizes  the  gain  of  the  node  given  the  policies  of  its  children.  There¬ 
fore,  it  is  independent  of  the  context  in  which  it  is  executed  and  could  be  reused  in  other 
hierarchies.  Now  we  first  describe  this  approach  in  detail  and  then  introduce  an  algorithm 
for  finding  recursively  optimal  average  reward  policies. 

In  HRL  methods,  we  typically  assume  that  every  time  a  subtask  except  root  is  called, 
it  starts  at  one  of  its  initial  states  and  terminates  at  one  of  its  terminal  states  after  a  finite 
number  of  steps.  Therefore,  we  make  the  following  assumption  for  every  subtask  i  in  the 
hierarchy  except  root.  Under  this  assumption,  each  instantiation  of  a  subtask  can  be  con¬ 
sidered  as  an  episode  and  each  subtask  as  an  episodic  problem. 

Assumption  3  (Subtask  Termination)  There  exists  a  distinguished  state  s*  £  Sj  such 
that,  for  all  hierarchical  stationary  policies  //  and  every  terminal  state  sj ,  we  have 

*ii(ailsi’>J“i(si’))  =  1  and  ri{s*\sf ,  =  0 

and,  for  all  non-terminal  states  s  £  S)  of  the  subtask,  we  have 

F?(s*\s,m(s))  =  0 

and  finally,  for  all  states  s  £  S),  we  have 

Tf(s*,n|s)  >  0 

where  n  =  |S)|  is  the  number  of  states  in  the  state  space  of  the  subtask. 

Although  subtasks  are  episodic  problems,  when  the  overall  task  is  continuing,  they  are 
executed  an  infinite  number  of  times  and  therefore  can  be  modeled  as  continuing  problems 
using  the  model  described  in  Figure  (2).  In  this  model,  each  subtask  i  terminates  at  one  of 
its  terminal  states  sj  £  Tj.  All  terminal  states  transit  with  probability  one  and  reward  zero 
to  a  distinguished  state  s*.  Finally,  the  distinguished  state  transits  with  reward  zero  to  one 
of  the  initial  states  (£  Ij)  of  the  subtask.  It  is  important  for  the  validity  of  this  model  to 
fix  the  value  of  the  distinguished  state  equal  to  zero. 

Under  this  model,  for  every  hierarchical  policy  //,  each  subtask  i  in  the  hierarchy  (except 
root )  can  be  modeled  as  a  Markov  chain  with  transition  probabilities 

'Fi(s'\ s,m(s))  s^s* 

Is- »(»))=<  n.(s')  s  =  s;  <8) 

and  rewards 

ri^(s'\s,  =  ri(s'\s,  fj,i(s)) 

where  ff*  is  a  probability  distribution  on  initial  states  of  subtask  i. 
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Terminal  F  +  . .  .+  F  =  1  Initial 

States  1  n  States 


Figure  2:  This  figure  shows  how  each  subtask  in  the  hierarchy  except  root  can  be  modeled 
as  a  continuing  task.  In  this  figure,  F  and  r  are  transition  probability  and  reward. 


Let  F-‘-  be  the  transition  matrix  with  entries  F^{s'\s,  iii{s))  and  let  be  the  set  of 
all  such  transition  matrices.  We  have  the  following  result  for  all  subtasks  in  the  hierarchy 
except  root. 

Theorem  1  Let  assumption  (3)  ( Subtask  Termination)  hold.  Then,  for  every  F]J-  £  JFU 
and  every  state  s  £  Si,  we  have5 

\Si\ 

|s)  >0 

n= 1 

Theorem  (1)  is  equivalent  to  assuming  that  the  underlying  Markov  chain  for  every  hierar¬ 
chical  policy  n  of  any  subtask  i  in  the  hierarchy  has  a  single  recurrent  class  and  state  s*  is 
its  recurrent  state.  Under  this  assumption,  for  every  subtask  i  in  the  hierarchy,  the  balance 
equations  for  every  hierarchical  policy  fi  have  a  unique  solution  7r?-L  and  the  average  reward 
(F-  is  well  defined  and  does  not  depend  on  the  initial  state.  Using  this  model,  we  define 
the  average  reward  of  subtask  i  under  the  hierarchical  policy  (j,  as: 

9i,z  =  £  'KiAs)riAs'  Is,  W(s)) 

s£Si 

where  is  the  steady  state  probability  of  being  in  state  s  under  hierarchical  policy  /i. 

In  the  next  section,  we  illustrate  the  recursively  optimal  average  reward  algorithm  us¬ 
ing  the  above  formulation.  We  consider  problems  for  which  assumptions  (1),  (2)  and  (3) 
( Continuing  Root  Task),  ( Root  Task  Recurrence)  and  ( Subtask  Termination)  hold  and  every 
subtask  in  the  hierarchy  except  root  is  modeled  as  an  average  reward  problem  using  the 
model  in  Figure  (2)  and  Equation  (8),  i.e. ,  the  average  reward  for  every  subtask  in  the 
hierarchy  including  root  is  well  defined  for  every  policy  and  does  not  vary  with  initial  state. 

5.  This  theorem  is  a  restatement  of  the  lemma  5  in  page  34  of  Peter  Marbach’s  thesis  (Marbach,  1998), 
which  is  applicable  to  the  model  described  in  Figure  (2). 
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3.4.1  Recursively  Gain  Optimal  Average  Reward  RL  Algorithm 

In  this  section,  we  describe  a  discrete-time  recursively  optimal  average  reward  HRL  algo¬ 
rithm.6  Since  we  are  interested  in  finding  a  recursive  optimal  policy,  we  can  exclude  the 
contents  of  the  Task  Stack  from  the  state  space  of  the  problem.  We  also  use  the  hierarchical 
model  of  Section  (2)  with  projected  average  adjusted  value  function  and  projected  average 
adjusted  action-value  function. 

We  show  how  the  overall  projected  average  adjusted  value  function  (the  projected  av¬ 
erage  adjusted  value  function  of  the  root  task)  of  a  hierarchical  policy  is  decomposed  into 
a  collection  of  projected  average  adjusted  value  functions  of  individual  subtasks  in  this  al¬ 
gorithm.  The  projected  average  adjusted  value  function  of  hierarchical  policy  //  on  subtask 
i,  denoted  H^(i,s),  is  the  average  adjusted  (with  respect  to  local  gain  g'1)  sum  of  rewards 
earned  of  following  policy  jttj  (and  the  policies  of  all  descendants  of  subtask  i)  starting  in 
state  s  until  subtask  i  terminates.  Now  let  us  suppose  that  the  first  action  chosen  by  g  is 
invoked  and  executed  for  a  number  of  primitive  steps  N  and  terminates  in  state  s'  according 
to  P?{s',N\s).  We  can  write  the  projected  average  adjusted  value  function  in  the  form  of 
a  Bellman  equation  as 

H,l(i,s)  =  r(s,m(s))  -  g^yi{s,tii(s))  +  Y  Tf(s',  N\s,  ^(s))FM(i,  s')  (9) 

N,s'£Si 

Since  r(s,  gi(s))  is  the  expected  total  reward  between  two  decision  epochs  of  subtask  i,  given 
that  the  system  occupies  state  s  at  the  first  decision  epoch,  decision  maker  chooses  action 
Hi(s)  and  the  number  of  time  steps  until  next  decision  epoch  is  defined  by  yi(s,  /a(s)),  we 
have 

r(s,m(s))  =  V^(s/i.(s))(rt(s),s)  =  +  g^i{s)yi(s,Hi(s)) 

By  replacing  r(s,gi(s))  from  the  above  expression,  Equation  (9)  can  be  written  as 

-  (#f  -  g^)yi(s,  fa(s))  +  Y  ^(s^N|.s,^:(s))LRi(^,,s,)  (10) 

N,s’eSi 

We  can  re-state  Equation  (10)  for  projected  action-value  function  as  follows: 

i>(z,s,a)  =  H^(a,s)  -  (gtf  -  g%)yi{s,  a)  +  Y  -Ff(s',  N\s,a)L^(i,s',m(s')) 

N,s'£Si 

In  the  above  equation,  the  term 

-(g?  ~  ga)Vi(s,a)  +  Y  pt(^’N\ s.«)^(M>i(s')) 

N,s'£Si 

denotes  the  average  adjusted  reward  of  completing  subtask  i  after  executing  action  a  in  state 
s.  We  call  this  term  completion  function  and  denote  it  by  C^(i,  s ,  a).  With  this  definition, 
we  can  express  the  average  adjusted  action- value  function  L ^  recursively  as 

LM(z,  s,  a)  =  H^(a,  s )  +  C^(i,  s,  a) 

6.  The  continuous-time  recursively  optimal  average  reward  HRL  algorithm  is  similar  and  was  introduced 
in  (Ghavamzadeh  and  Mahadevan,  2001). 
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and  we  can  re-express  the  definition  for  H  as 


l  Z,  PVMWM  -  tf] 


if  i  is  a  composite  action 
if  i  is  a  primitive  action 


(11) 


The  above  formulas  can  be  used  to  obtain  update  equations  for  H  and  C  functions  in 
discrete-time  recursively  optimal  average  reward  model.  Pseudo-code  for  this  algorithm  is 
shown  in  Algorithm  (2).  After  running  for  appropriate  time,  this  algorithm  should  generate 
a  recursively  gain-optimal  policy  that  maximizes  the  gain  of  each  subtask  given  the  policies 
of  its  children.  In  this  algorithm,  a  gain  is  defined  for  every  subtask  in  the  hierarchy  (even 
primitive  subtasks)  and  this  gain  is  updated  every  time  the  subtask  is  non-randomly  chosen. 
Primitive  subtasks  update  their  projected  average  adjusted  value  functions  H  (line  5)  and 
gain  (line  7),  whereas  non-primitive  subtasks  update  their  completion  functions  C  (line 
19)  and  gain  (line  21).  The  projected  average  adjusted  value  function  H  for  non-primitive 
subtasks  used  at  lines  15  and  19  is  recursively  calculated  using  Equation  (11). 


4.  The  AGV  Scheduling  Task 

In  this  section,  we  provide  a  brief  overview  of  the  AGV  scheduling  problem  used  in  the 
experiments  of  this  paper.  Automated  Guided  Vehicles  (AGVs)  are  used  in  flexible  man¬ 
ufacturing  systems  (FMS)  for  material  handling  (Askin  and  Standridge,  1993).  They  are 
typically  used  to  pick  up  parts  from  one  location,  and  drop  them  off  at  another  location 
for  further  processing.  Locations  correspond  to  workstations  or  storage  locations.  Loads 
which  are  released  at  the  drop-off  point  of  a  workstation  wait  at  its  pick  up  point  after  the 
processing  is  over,  so  the  AGV  is  able  to  take  it  to  the  warehouse  or  some  other  locations. 
The  pickup  point  is  the  machine  or  workstation’s  output  buffer.  Any  FMS  system  using 
AGVs  faces  the  problem  of  optimally  scheduling  the  paths  of  AGVs  in  the  system  (Klein 
and  Kim,  1996).  For  example,  a  move  request  occurs  when  a  part  finishes  at  a  workstation. 
If  more  than  one  vehicle  is  empty,  the  vehicle  which  would  service  this  request  needs  to  be 
selected.  Also,  when  a  vehicle  becomes  available,  and  multiple  move  requests  are  queued, 
a  decision  needs  to  be  made  as  to  which  request  should  be  serviced  by  that  vehicle.  These 
schedules  obey  a  set  of  constraints  that  reflect  the  temporal  relationships  between  activities 
and  the  capacity  limitations  of  a  set  of  shared  resources. 

The  uncertain  and  ever  changing  nature  of  the  manufacturing  environment  makes  it 
virtually  impossible  to  plan  moves  ahead  of  time.  Hence,  AGV  scheduling  requires  dynamic 
dispatching  rules,  which  are  dependent  on  the  state  of  the  system  like  the  number  of  parts 
in  each  buffer,  the  state  of  the  AGV  and  the  process  going  on  at  workstations.  The  system 
performance  is  generally  measured  in  terms  of  the  throughput,  the  on-line  inventory,  the 
AGV  travel  time  and  the  flow  time,  but  the  throughput  is  by  far  the  most  important  factor. 
The  throughput  is  measured  in  terms  of  the  number  of  finished  assemblies  deposited  at 
the  unloading  deck  per  unit  time.  Since  this  problem  is  analytically  intractable,  various 
heuristics  and  their  combinations  are  generally  used  to  schedule  AGVs  (Klein  and  Kim, 
1996).  However,  the  heuristics  perform  poorly  when  the  constraints  on  the  movement  of 
the  AGVs  are  reduced. 

Previously,  Tadepalli  and  Ok  (Tadepalli  and  Ok,  1996b)  studied  a  single-agent  AGV 
scheduling  task  using  flat  average  reward  reinforcement  learning.  In  the  next  section,  we 
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Algorithm  2  The  discrete-time  recursively  optimal  average  reward  HRL  algorithm. 

1:  Function  RO-AR(Task  i,  State  s) 

2:  let  Seq  =  {}  be  the  sequence  of  ( state-visited ,  reward )  while  executing  i 
3:  if  i  is  a  primitive  action  then 

4:  execute  action  i  in  state  s,  receive  reward  r(s/|s,  i)  and  observe  state  s' 

5:  s)  < —  (1  -  at)Ht(i ,  s)  +  at(r(s’\s,  i )  -  g(i)) 

6:  if  i  and  all  its  ancestors  are  non-random  actions  then 

7:  update  gain  of  subtask  i  9t+i(i)  = 

8:  end  if 

9:  push  ( state  s,  reward  r(s'|s,i))  onto  the  front  of  Seq 

10:  else 

11:  while  i  has  not  terminated  do 

12:  choose  action  a  according  to  the  current  exploration  policy  Pi(s) 

13:  let  ChildS eq=RO-AR,( a .  s),  where  ChildSeq  is  the  sequence  of  ( state-visited ,  reward ) 

while  executing  action  a 
14:  observe  result  state  s' 

15:  let  a*  =  argmaxareA.(s>\[Ct(i,  s',  a')  +  Ht{a' ,  s')] 

16:  let  N  =  0;  p  =  0; 

17:  for  each  ( s ,  r)  in  ChildSeq  from  the  beginning  do 

18:  N  =  N+ 1;  p  =  p  +  r; 

19:  Ct+i(i,s,a)  < —  (l-at)Ct(i,  s,a)+at[Ct(i,  s' ,a*)+Ht(a* ,  s')-(gt(i)-gt(a))N] 

20:  if  a  and  all  its  ancestors  are  non-random  actions  then 

21:  update  gain  of  subtask  i  9t+i(i)  = 

22:  end  if 

23:  end  for 

24:  append  ChildSeq  onto  the  front  of  Seq 

25:  s  =  s' 

26:  end  while 

27:  end  if 
28:  return  Seq 
29:  end  RO-AR 


study  both  single-agent  and  more  complex  multiagent  AGV  scheduling  tasks  and  apply  the 
HRL  algorithms  described  in  previous  section  to  these  tasks. 

5.  Experimental  Results 

The  goal  of  this  section  is  to  demonstrate  the  efficacy  of  the  algorithms  proposed  in  this 
paper.  We  show  the  type  of  the  optimality  that  they  converge  to  as  well  as  their  per¬ 
formance  and  speed  comparing  to  other  algorithms.  We  conduct  four  sets  of  experiments 
in  this  section.  In  Section  (5.1),  we  apply  five  hierarchical  RL  algorithms  to  a  simple 
discrete-time  AGV  scheduling  problem.  The  advantage  of  using  this  simple  domain  is  that 
it  clearly  demonstrates  the  difference  between  hierarchical  and  recursive  optimal  policies 
and  differences  between  the  optimality  criteria  achieved  by  these  algorithms.  In  Section 


19 


Ghavamzadeh  &  Mahadevan 


(5.2),  we  use  a  modified  version  of  the  well-known  Taxi  problem  (Dietterich,  2000).  Since 
hierarchical  and  recursive  optimal  policies  are  not  different  in  this  domain,  we  just  test 
two  hierarchically  optimal  algorithms  on  this  problem.  Then  we  will  turn  to  more  complex 
multiagent  and  single-agent  AGV  tasks  in  Sections  (5.3),  and  (5.4)  to  demonstrate  the  per¬ 
formance  and  speed  of  the  proposed  algorithms  in  complex  domains.  In  Section  (5.3),  we  use 
a  complex  continuous-time  multiagent  AGV  scheduling  problem  and  compare  the  perfor¬ 
mance  and  speed  of  our  continuous-time  recursively  gain  optimal  average  reward  algorithm 
with  continuous-time  recursively  optimal  discounted  reward  HRL  algorithm  introduced  in 
(Ghavamzadeh  and  Mahadevan,  2001)  as  well  as  three  widely  used  industrial  AGV  schedul¬ 
ing  heuristics.  Finally  in  Section  (5.4),  we  model  a  single-agent  AGV  scheduling  task  as 
discrete  and  continuous  time  problems  and  apply  three  hierarchical  RL  algorithms  as  well 
as  a  flat  RL  algorithm  to  both  models. 


5.1  Simple  AGV  Scheduling  Problem 

In  this  section,  we  apply  the  discrete-time  hierarchically  gain  optimal  algorithm  (HO-AR) 
described  in  Section  (3.3),  the  discrete-time  recursively  gain  optimal  algorithm  (RO-AR)  il¬ 
lustrated  in  Section  (3.4.1),  and  HH-Learning,  the  algorithm  proposed  by  Seri  and  Tadepalli 
(Seri  and  Tadepalli,  2002)  to  a  simple  AGV  scheduling  task.  We  also  test  MAXQ  (a  re¬ 
cursively  optimal  discounted  reward  HRL  algorithm)  (Dietterich,  2000)  and  a  hierarchically 
optimal  discounted  reward  RL  algorithm  (HO-DR)  on  this  task.  We  derived  HO-DR  algo¬ 
rithm  by  combining  the  three  parts  value  function  decomposition  introduced  by  Andre  and 
Russell  (Andre  and  Russell,  2002)  and  MAXQ  hierarchical  task  decomposition  (Dietterich, 
2000).  These  experimental  results  clearly  demonstrate  the  difference  between  hierarchical 
and  recursive  optimal  policies  and  between  the  optimality  criteria  achieved  by  the  above 
algorithms. 

A  small  AGV  domain  is  depicted  in  Figure  (3).  In  this  domain  there  are  two  machines 
(Ml  and  M2)  that  produce  parts  to  be  delivered  to  corresponding  destination  stations  (G1 
and  G 2).  Since  machines  and  destination  stations  are  in  two  different  rooms,  the  AGV  has 
to  pass  one  of  the  two  doors  (D 1  and  D 2)  every  time  it  goes  from  one  room  to  another. 
Part  1  is  more  important  than  part  2,  therefore  the  AGV  gets  a  reward  of  20  when  part 
1  delivered  to  destination  GT  and  a  reward  of  1  when  part  2  delivered  to  destination  G2. 
The  AGV  receives  a  reward  of  -1  for  all  other  actions.  This  task  is  deterministic  and  the 
state  variables  are  AGV  location  and  AGV  status  (empty,  carry  part  1  or  carry  part  2), 
which  is  total  of  26  x  3  =  78  states.  In  all  experiments,  we  use  the  task  graph  shown  in 
Figure  (3)  and  set  the  discount  factor  to  0.99  for  discounted  reward  algorithms.  We  tried 
several  discounting  factors  and  0.99  yielded  the  best  performance.  Using  this  task  graph, 
hierarchical  and  recursive  optimal  policies  are  different.  Since  delivering  part  1  has  more 
reward  than  part  2,  the  hierarchically  optimal  policy  is  one  in  which  the  AGV  always  serves 
machine  Ml.  In  the  recursively  optimal  policy,  the  AGV  switches  from  serving  machine 
Ml  to  serving  machine  M2  and  vice  versa.  In  this  policy,  the  AGV  goes  to  machine  Ml, 
picks  up  a  part  of  type  1,  goes  to  goal  G 1  via  door  D 1,  drops  the  part  there,  then  passes 
through  door  D 2,  goes  to  machine  M2,  picks  up  a  part  of  type  2,  goes  to  goal  G2  via  door 
D2  and  then  switches  again  to  machine  Ail  and  so  on  so  forth. 
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Ml:  Machine  1  M2:  Machine  2 


Dl:  Door  1 


D2:  Door  2 


Gl:  Goal  1 


G2:  Goal  2 


Figure  3:  A  simple  AGV  domain  (left)  and  its  associated  task  graph  (right). 


Among  the  algorithms  we  applied  to  this  task,  the  hierarchically  gain  optimal  average 
reward  RL  (HO-AR)  and  the  hierarchically  optimal  discounted  reward  RL  (HO-DR)  al¬ 
gorithms  find  the  hierarchically  optimal  policy,  where  the  other  algorithms  only  learn  the 
recursively  optimal  policy.  Figure  (4)  demonstrates  the  throughput  of  the  system  for  the 
above  algorithms.  In  this  figure,  the  throughput  of  the  system  is  the  number  of  parts  de¬ 
posited  at  the  destination  stations  weighted  by  their  reward  ((part  1  x  20)  +  (part2  x  1))  in 
10000  time  steps.  Each  experiment  was  conducted  ten  times  and  the  results  averaged. 


Figure  4:  This  plot  shows  that  HO-DR  and  HO-AR  algorithms  learn  the  hierarchically 
optimal  policy  while  MAXQ,  RO-AR  and  HH-Learning  only  find  the  recursively 
optimal  policy  for  the  simple  AGV  task. 


5.2  Modified  Taxi  Problem 

In  this  section,  we  apply  the  discrete-time  hierarchically  gain  optimal  average  reward  RL  al¬ 
gorithm  (HO-AR)  described  in  Section  (3.3)  and  the  discounted  reward  hierarchically  optimal 
RL  algorithm  (HO-DR)  to  a  modified  version  of  the  well-known  Taxi  problem  (Dietterich, 
2000). 
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Unlike  the  original  Taxi  problem  (Dietterich,  2000),  the  version  used  in  this  paper  is 
a  continuing  task.  A  5-by-5  grid  world  inhabited  by  a  taxi  agent  is  shown  in  Figure  (5). 
There  are  four  stations,  marked  as  B(lue),  G(reen),  R(ed)  and  Y(ellow).  The  taxi  starts 
in  a  randomly  chosen  location  and  passengers  randomly  appear  at  these  four  stations. 
The  passenger  at  each  station  wishes  to  be  transported  to  one  of  the  other  three  stations 
(also  chosen  randomly).  The  taxi  must  go  to  one  of  the  passenger’s  locations,  pick  up  the 
passenger,  go  to  its  destination  location  and  drop  off  the  passenger  there.  Then,  passengers 
once  again  randomly  appear  in  four  stations  and  the  task  continues.  Each  navigation  action 
with  probability  0.7  causes  the  taxi  to  move  one  cell  in  the  corresponding  direction,  and  with 
probability  0.3  moves  the  agent  in  one  of  the  other  three  directions,  each  with  probability 
0.1.  The  system  performance  is  measured  in  terms  of  the  number  of  passengers  dropped 
off  at  their  destinations  per  a  fixed  number  of  time  steps.  The  state  variables  in  this  task 
are  taxi  location ,  taxi  status,  status  of  each  station  (whether  there  is  a  passenger  waiting 
at  that  station  or  not),  and  destination  of  passenger  at  each  station,  which  equals  512,000 
states. 
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T:  Taxi 

B:  Blue  Station 
G:  Green  Station 
R:  Red  Station 
Y:  Yellow  Station 


Figure  5:  The  Taxi  Domain  (left)  and  its  associated  task  graph  (right). 


Figure  (6)  compares  the  proposed  discrete-time  hierarchically  gain  optimal  algorithm 
(HO-AR)  with  the  discrete-time  hierarchically  optimal  discounted  reward  algorithm  (HO- 
DR)  showing  the  better  performance  of  the  average  reward  algorithm.  Each  experiment 
was  conducted  ten  times  and  the  results  averaged.  With  the  task  graph  depicted  in  Figure 
(5),  the  hierarchical  and  recursive  optimal  policies  are  not  different  for  this  problem.  Hence, 
we  did  not  test  the  recursively  optimal  algorithms  on  this  domain. 

5.3  Multiagent  AGV  Scheduling  Problem  (Continuous-Time  Model) 

In  this  section,  we  apply  continuous-time  recursively  optimal  discounted  reward  HRL  algo¬ 
rithm  introduced  in  (Ghavamzadeh  and  Mahadevan,  2001)  and  continuous-time  recursively 
gain  optimal  average  reward  algorithm  (RO-AR)  illustrated  in  Section  (3.4.1)  to  a  complex 
multiagent  AGV  scheduling  problem  and  compare  their  performance  and  speed  with  each 
other,  as  well  as  several  well-known  AGV  scheduling  heuristics. 

We  use  a  modified  version  of  the  above  two  continuous-time  algorithms.  This  modi¬ 
fication  makes  them  well  suited  to  multiagent  problems.  The  most  salient  feature  of  this 
extension  is  that  the  top  level  (the  level  immediately  below  the  root)  of  the  hierarchy  is 
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Time  step  since  start  of  simulation 


Figure  6:  This  plot  shows  that  HO-AR  algorithm  works  better  than  the  discounted  reward 
HO-DR  (with  discount  factor  0.9)  on  the  modified  taxi  problem.  We  tried  several 
discounting  factors  and  0.9  yielded  the  best  performance. 


configured  to  store  the  completion  function  values  C  for  joint  abstract  actions  of  all  agents. 
The  completion  function  for  agent  j.  C3(i,  s,  a1, ...,  a-7, ...,  an),  is  defined  as  the  expected 
(discounted  or  undiscounted)  reward  of  completion  of  subtask  a3  by  agent  j  in  the  context 
of  the  other  agents  performing  subtasks  a*,  Vi  /  j  E  {1  where  s  is  the  local  state 

of  agent  j  not  the  joint  state.  This  method  reduces  the  number  of  joint  state-action  val¬ 
ues  that  need  to  be  learned  in  a  complex  multiagent  task,  and  provides  a  sufficiently  good 
approximation  of  the  true  value  functions  (see  (Makar  et  al.,  2001)  for  details). 

Figure  (7)  shows  the  layout  of  the  AGV  scheduling  problem  used  in  this  experiment. 
Ml  to  M 4  show  workstations  in  this  environment.  Parts  of  type  i  have  to  be  carried  to 
drop  off  station  at  workstation  i  ( Di ),  and  the  assembled  parts  brought  back  from  pick  up 
stations  of  workstations  (Pi  s),  to  the  warehouse.  The  AGV  travel  is  unidirectional  (as  the 
arrows  show). 

Each  agent  uses  a  copy  of  the  task  graph  in  Figure  (8).  Learning  is  decentralized,  with 
each  agent  learning  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do 
them  in,  and  how  to  coordinate  with  other  agents.  Coordination  skills  among  agents  are 
learned  by  using  joint  actions  at  the  highest  level  of  the  hierarchy  as  described  above. 

The  state  of  the  environment  consists  of  the  number  of  parts  in  the  pickup  and  drop-off 
stations  of  each  machine,  and  whether  the  warehouse  contains  parts  of  each  of  the  four 
types.  In  addition,  each  agent  keeps  track  of  its  own  location  and  status  as  a  part  of  its 
state  space.  Thus,  in  the  flat  case,  state  space  consists  of  100  locations,  8  buffers  of  size  3, 
9  possible  states  of  the  AGV  (carrying  Parti,  . . .  ,  carrying  Assemblyl,  . . .  ,  empty),  and  2 
values  for  each  part  in  the  warehouse,  i.e.  100  x  48  x  9  x  24  ~  230  states,  which  is  enormous. 
State  abstraction  helps  in  reducing  the  state  space  considerably.  Only  the  relevant  state 
variables  are  used  while  storing  the  completion  functions  in  each  node  of  the  task  graph. 
For  example,  for  the  Navigation  subtask,  only  the  location  state  variable  is  relevant,  and 
this  subtask  can  be  learned  with  100  values.  Hence,  for  the  highest  level  actions  DM1,  . . . 
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Figure  7: 


A  multiagent  AGV  scheduling  task.  There  are  four  AGV  agents  (not  shown) 
which  carry  raw  materials  and  finished  parts  between  machines  and  warehouse. 


Figure  8:  Task  graph  for  the  AGV  scheduling  task. 

,  DM4 1  the  number  of  relevant  states  would  be  100  x9x4x2~213,  and  for  the  highest 
level  actions  DAI,  ...  ,  DA4,  the  number  of  relevant  states  would  be  100  x  9  x  4  ~  212. 
This  state  abstraction  gives  us  a  compact  way  of  representing  the  C  functions,  and  speeds 
up  the  algorithm  (Dietterich,  2000). 

The  experimental  results  were  generated  with  the  following  model  parameters.  The 
inter-arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean  of  4  sec 
and  variance  of  1  sec.  The  percentage  of  Part  1,  Part2,  Part3  and  PartA  in  the  part  arrival 
process  are  20,  28,  22  and  30  respectively.  The  time  required  for  assembling  the  various 
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parts  is  normally  distributed  with  means  15,  24,  24  and  30  sec  for  Parti ,  Part2,  Part3 
and  PartA  respectively,  and  the  variance  2  sec.  The  execution  time  of  primitive  actions 
( navigation  actions,  load  and  unload )  is  normally  distributed  with  mean  1000  and  variance 
50  micro  sec.  The  execution  time  of  idle  action  is  normally  distributed  with  mean  1  sec 
and  variance  0.1  sec.  Table  (1)  shows  the  value  of  all  the  parameters  of  the  continuous-time 
model  used  in  the  experimental  results  of  this  section.  Each  experiment  was  conducted  five 
times  and  the  results  averaged. 


Parameter 

Type  of  Distribution 

Mean  (sec) 

Variance  (sec) 

Idle  Action 

Normal 

1 

0.1 

Primitive  Actions 

Normal 

0.001 

0.00005 

Assembly  Time  for  Parti 

Normal 

15 

2 

Assembly  Time  for  Part2 

Normal 

24 

2 

Assembly  Time  for  Part3 

Normal 

24 

2 

Assembly  Time  for  Part4 

Normal 

30 

2 

Inter-Arrival  Time  for  Parts 

Uniform 

4 

1 

Table  1:  Model  Parameters 


Figure  (9)  shows  the  throughput  of  the  system  for  continuous-time  recursively  optimal 
discounted  reward  HRL  algorithm  and  continuous-time  recursively  gain  optimal  average 
reward  (RO-AR)  algorithm.  As  seen  in  this  figure,  the  agents  learn  a  little  faster  initially  in 
the  discounted  reward  method,  but  the  final  system  throughput  achieved  using  the  average 
reward  algorithm  is  higher  than  the  discounted  reward  case.  This  figure  also  compares  these 
two  algorithms  with  several  well-known  AGV  scheduling  rules,  highest  queue  first,  nearest 
station  first  and  first  come  first  serve,  showing  clearly  the  improved  performance  of  the 
reinforcement  learning  methods. 

14 


12 

10 


o 

-C 

I- 

4 


0 

0  10000  20000  30000  40000  50000  60000 

Time  since  start  of  simulation  (sec) 


Highest  Queue  First  Heuristic  - 

Nearest  Station  First  Heuristic  - 

First  Come  First  Serve  Heurisitic 
Continuous-Time  Recursively  Gain  Optimal  Average  Reward  RO-AR 

Continuous-Time  Recursively  Optimal  Discounted  Reward  - 


Figure  9:  This  plot  shows  continuous-time  recursively  gain  optimal  average  reward  (RO¬ 
AR)  algorithm  outperforms  continuous-time  recursively  optimal  discounted  re¬ 
ward  algorithm.  It  also  demonstrates  both  these  algorithms  outperform  three 
well-known  widely  used  (industrial)  heuristics  for  AGV  scheduling. 
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5.4  AGV  Scheduling  Problem  (Discrete  and  Continuous  Time  Models) 

In  this  section,  we  describe  two  sets  of  experiments  on  a  modified  version  of  the  AGV 
scheduling  task  described  in  Section  (5.3).  In  the  experiments  of  this  section,  we  assume 
a  single-agent  problem  with  only  three  machines  in  the  environment  (Figure  (7)  without 
machine  M3).  It  reduces  the  number  of  states  to  1,347,192  states.  We  model  this  AGV 
scheduling  task  using  both  discrete-time  and  continuous-time  models,  and  compare  the 
performance  and  speed  of  three  HRL  algorithms:  hierarchically  gain  optimal  RL  (HO-AR), 
hierarchically  optimal  discounted  reward  RL  (HO-DR)  and  recursively  gain  optimal  RL  (RO¬ 
AR)  as  well  as  a  non-hierarchical  average  reward  algorithm.  In  both  sets  of  experiments,  we 
use  the  task  graph  for  the  AGV  scheduling  task  shown  in  Figure  (8)  and  discount  factors 
0.9  and  0.95  for  discounted  reward  algorithms.  In  both  experiments,  using  a  discount  factor 
of  0.95  yielded  better  performance. 

The  discrete-time  experimental  results  were  generated  with  the  following  model  parame¬ 
ters.  The  inter- arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean 
of  12  time  steps  and  variance  of  2  time  steps.  The  percentage  of  Parti ,  Part2  and  Part3  in 
the  part  arrival  process  are  40,  35  and  25  respectively.  The  time  required  for  assembling  the 
various  parts  are  Poisson  random  variables  with  means  6,  10  and  12  time  steps  for  Parti, 
Part.2  and  Part3  respectively,  and  variance  2  time  steps.  Table  (2)  shows  the  parameters 
of  the  discrete-time  model. 


Parameter 

Distribution 

Mean  (steps) 

Variance  (steps) 

Assembly  Time  for  Parti 

Poisson 

6 

2 

Assembly  Time  for  Part2 

Poisson 

10 

2 

Assembly  Time  for  Part3 

Poisson 

12 

2 

Inter- Arrival  Time  for  Parts 

Uniform 

12 

2 

Table  2:  Parameters  of  the  Discrete-Time  Model 


The  continuous-time  experimental  results  were  generated  with  the  following  model  pa¬ 
rameters.  The  time  required  for  execution  of  each  primitive  action  is  a  normal  random 
variable  with  mean  10  sec  and  variance  2  sec.  The  inter-arrival  time  for  parts  at  the  ware¬ 
house  is  uniformly  distributed  with  a  mean  of  100  seconds  and  variance  of  20  seconds.  The 
percentage  of  Parti,  Part2  and  Part.3  in  the  part  arrival  process  are  40,  35  and  25  re¬ 
spectively.  The  time  required  for  assembling  the  various  parts  are  normal  random  variables 
with  means  100,  120  and  180  sec  for  Part  1,  Part2  and  Part3  respectively,  and  variance  20 
sec.  Table  (3)  contains  the  parameters  of  the  continuous-time  model.  In  both  cases,  each 
experiment  was  conducted  five  times  and  the  results  averaged. 


Parameter 

Type  of  Distribution 

Mean  (sec) 

Variance  (sec) 

Execution  Time  for  Primitive  Actions 

Normal 

10 

2 

Assembly  Time  for  Parti 

Normal 

100 

20 

Assembly  Time  for  Part  2 

Normal 

120 

20 

Assembly  Time  for  Part3 

Normal 

180 

20 

Inter-Arrival  Time  for  Parts 

Uniform 

100 

20 

Table  3:  Parameters  of  the  Continuous-Time  Model 
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Figure  (10)  compares  the  proposed  discrete-time  hierarchically  gain  optimal  algorithm 
(HO-AR)  described  in  Section  (3.3)  with  the  discrete-time  discounted  reward  hierarchically 
optimal  algorithm  (HO-DR)  and  the  discrete-time  recursively  gain  optimal  algorithm  (RO¬ 
AR)  illustrated  in  Section  (3.4.1).  The  graph  shows  the  improved  performance  of  the 
proposed  discrete-time  average  reward  algorithm  (HO-AR).  This  figure  also  shows  that  the 
HO-AR  algorithm  converges  faster  to  the  same  throughput  as  the  non-hierarchical  average 
reward  algorithm.  The  non-hierarchical  average  reward  algorithm  used  in  this  experiment 
is  relative  value  iteration  (RVI)  Q-learning  (Abounadi  et  al.,  2001).  The  difference  in 
convergence  speed  between  flat  and  hierarchical  algorithms  becomes  more  significant  as  we 
increase  the  number  of  states. 


Figure  10:  This  plot  shows  that  the  discrete-time  HO-AR  algorithm  performs  better  than 
both  the  discounted  reward  HO-DR  and  RO-AR  algorithms  on  the  AGV  schedul¬ 
ing  task.  It  also  demonstrates  the  faster  convergence  of  the  HO-AR  algorithm 
comparing  to  the  non-hierarchical  average  reward  algorithm  (RVI  Q-learning). 


Figure  (11)  compares  the  continuous-time  hierarchically  gain  optimal  algorithm  (HO¬ 
AR)  proposed  in  Section  (3.3)  with  the  continuous-time  hierarchically  optimal  discounted 
reward  HO-DR  algorithm  and  the  continuous-time  recursively  gain  optimal  algorithm  (RO¬ 
AR)  described  in  Section  (3.4.1).  The  graph  shows  that  the  HO-AR  converges  to  the 
same  performance  as  the  discounted  reward  HO-DR  algorithm.  Both  clearly  have  better 
performance  than  the  average  reward  recursively  optimal  algorithm  (RO-AR).  This  figure 
also  shows  that  the  HO-AR  algorithm  converges  faster  to  the  same  throughput  as  the 
non-hierarchical  average  reward  algorithm.  The  non-hierarchical  average  reward  algorithm 
used  in  this  experiment  is  a  continuous-time  version  of  the  relative  value  iteration  (RVI) 
Q-learning  (Abounadi  et  al.,  2001).  The  difference  in  convergence  speed  between  flat  and 
hierarchical  algorithms  becomes  more  significant  as  we  increase  the  number  of  states. 

These  results  in  the  last  two  sections  are  consistent  with  the  hypothesis  that  the  undis¬ 
counted  optimality  paradigm  is  superior  to  the  discounted  framework  for  learning  a  gain- 
optimal  policy,  since  undiscounted  methods  do  not  need  careful  tuning  of  the  discount  factor 
to  find  gain-optimal  policies. 
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Figure  11:  This  plot  shows  that  the  continuous-time  HO-AR  converges  to  the  same  perfor¬ 
mance  as  the  discounted  reward  HO-DR,  and  both  outperform  the  recursively 
optimal  average  reward  algorithm  on  the  AGV  scheduling  task.  It  also  demon¬ 
strates  the  faster  convergence  of  the  HO-AR  compared  to  the  flat  average  reward 
algorithm  (RVI  Q-learning). 


6.  Conclusion  and  Future  Work 

This  paper  presents  new  discrete-time  and  continuous-time  hierarchical  reinforcement  learn¬ 
ing  algorithms  applicable  to  continuing  tasks,  including  manufacturing,  scheduling,  queuing 
and  inventory  control.  These  algorithms  are  based  on  the  average-reward  SMDP  model, 
which  has  been  shown  to  be  more  appropriate  for  a  wide  class  of  continuing  tasks.  These 
hierarchical  average-reward  reinforcement  learning  algorithms  can  be  categorized  into  two 
groups  corresponding  to  two  notions  of  optimality  that  have  been  studied  in  previous  work 
on  HRL:  hierarchical  optimality  and  recursive  optimality.  Hierarchically  gain  optimal  av¬ 
erage  reward  RL  algorithms  aim  to  find  a  globally  gain  optimal  policy  within  the  space  of 
policies  defined  by  the  hierarchical  decomposition.  In  the  recursive  optimal  average  reward 
HRL  setting,  the  formulation  of  learning  algorithms  directly  depends  on  the  local  optimality 
criterion  used  for  each  subtask  in  the  hierarchy.  In  this  paper,  we  investigate  several  meth¬ 
ods  to  formulate  subtasks,  however,  recursively  gain  optimal  average  reward  RL  algorithms 
proposed  in  this  paper  treat  subtasks  as  continuing  problems  and  solve  them  by  finding 
gain  optimal  policies  given  the  policies  of  their  children.  A  secondary  contribution  of  this 
paper  is  to  illustrate  how  hierarchical  reinforcement  learning  can  be  applied  to  more  inter¬ 
esting  and  practical  domains  than  has  been  shown  previously.  In  particular,  we  focus  on 
AGV  scheduling,  although  our  approach  easily  generalizes  to  other  industrial  optimization 
problems  such  as  transfer  line  production  control. 

The  effectiveness  of  the  proposed  algorithms  were  tested  using  four  experimental  testbeds: 
a  small  AGV  scheduling  domain,  a  modified  version  of  the  Taxi  problem,  and  much  larger 
real-world  single- agent  and  multiagent  AGV  domains.  The  proposed  algorithms  performed 
well  in  all  domains,  and  in  particular,  in  the  multiagent  AGV  domain,  we  showed  that  our 
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proposed  algorithms  outperform  widely  used  industrial  heuristics,  such  as  “first  come  first 
serve”,  “highest  queue  first”  and  “nearest  station  first” . 

There  are  a  number  of  directions  for  future  work  which  can  be  briefly  outlined.  An 
immediate  question  that  arises  is  the  convergence  of  the  algorithms  to  recursive  and  hi¬ 
erarchical  optimal  policies.  These  results  should  provide  some  theoretical  validity  to  the 
proposed  methods,  in  addition  to  their  empirical  effectiveness  demonstrated  in  this  pa¬ 
per.  It  is  obvious  that  many  other  manufacturing  and  robotics  problems  can  benefit  from 
these  algorithms.  Combining  hierarchical  reinforcement  learning  with  function  approxima¬ 
tion  and  factored  action  models  is  an  important  area  for  research.  In  this  direction,  we 
are  currently  working  to  develop  a  hierarchical  reinforcement  learning  framework  suitable 
for  problems  with  continuous  state  spaces,  using  a  mixture  of  policy  gradient-based  RL 
and  value  function-based  RL  methods  (Ghavamzadeh  and  Mahadevan,  2003).  We  used 
the  flexibility  provided  by  recursive  optimality  and  developed  HRL  algorithms  for  learning 
in  this  hierarchical  hybrid  framework.  In  this  hybrid  framework,  when  the  overall  task  is 
continuing,  we  formulate  the  root  task  as  a  continuing  problem  using  the  average  reward 
criterion.  Since  the  policy  learned  at  root  involves  policies  of  its  children,  the  type  of  opti¬ 
mality  achieved  at  root  depends  on  how  we  formulate  other  subtasks  in  the  hierarchy.  We 
are  currently  investigating  different  formulations  for  average  reward  recursive  optimality  de¬ 
scribed  in  this  paper,  in  our  hierarchical  hybrid  framework.  Finally,  deriving  abstractions 
automatically  is  another  fundamental  problem  that  needs  to  be  addressed. 
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Appendix  A. 
List  of  Notation 


Notation 

Definition 

H 

hierarchy 

M 

set  of  natural  numbers 

S 

set  of  states 

0 

set  of  possible  values  for  Task  Stack  in  hierarchy 

I  =  ilxS 

joint  state  space  of  Task  Stack  values  and  states 

X  =  (cu,  s ) 

joint  state  value  x  formed  by  Task  Stack  value  u>  and  state  value  s 

ui  /*  i 

popping  subtask  i  off  the  Task  Stack  with  content  u> 

i\u 

pushing  subtask  i  onto  the  Task  Stack  with  content  u 

\S\ 

cardinality  of  the  set  S 

A 

set  of  actions 

P 

transition  probability  function 

R 

reward  function 

St 

set  of  states  for  subtask  i 

Ai 

set  of  actions  for  subtask  i 

Ri 

reward  function  for  subtask  i 

h 

initiation  set  for  subtask  i 

Ti 

termination  set  for  subtask  i 

T 

si 

a  terminal  state  of  subtask  i,  sf  £  T, 

Hi 

policy  for  subtask  i 

H  =  {HO,---  ,Hn} 

hierarchical  policy 

H* 

hierarchical  optimal  policy 

Pt(s',N\s) 

probability  that  action  /ii(s)  causes  transition  from  state  s  to  state  s' 
in  N  primitive  steps  under  the  hierarchical  policy  /.t 

Fil(s',n\s) 

probability  of  transition  from  state  s  to  state  s'  in  n  actions  taken 
by  subtask  i  under  the  hierarchical  policy  /u, 

gain  of  policy  /r 

y(s,a ) 

expected  number  of  transition  steps  until  the  next  decision  epoch 

7 rM(s) 

steady  state  probability  of  being  in  state  s  for  Markov  chain  defined 

by  policy  /i 

7T^ 

steady  state  probability  vector  of  the  Markov  chain  defined  by  policy  n 

7 h 

probability  distribution  on  initial  states  of  subtask  i 

V v 

hierarchical  value  function  of  hierarchical  policy  /r 

v» 

projected  value  function  of  hierarchical  policy  /i 

hierarchical  average  adjusted  value  function  of  hierarchical  policy  /r 

H>L 

projected  average  adjusted  value  function  of  hierarchical  policy  /i 

hierarchical  average  adjusted  action-value  function  of  hierarchical  policy  /r 

L» 

projected  average  adjusted  action-value  function  of  hierarchical  policy  /r 

completion  function  of  hierarchical  policy  n 

Table  4:  List  of  notation  used  in  the  paper 
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